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Abstract 

A classical problem in statistics is estimating the expected coverage of a sample, which has had appli- 
cations in gene expression, microbial ecology, optimization, and even numismatics. Here we consider 
a related extension of this problem to random samples of two discrete distributions. Specifically, we 
estimate what we call the dissimilarity probability of a sample, i.e., the probability of a draw from one 
distribution not being observed in k draws from another distribution. We show our estimator of dissimi- 
larity to be a [/-statistic and a uniformly minimum variance unbiased estimator of dissimilarity over the 
largest appropriate range of k. Furthermore, despite the non-Markovian nature of our estimator when 
applied sequentially over k, we show it converges uniformly in probability to the dissimilarity parameter, 
and we present criteria when it is approximately normally distributed and admits a consistent jackknife 
estimator of its variance. As proof of concept, we analyze V35 16S rRNA data to discern between various 
microbial environments. Other potential applications concern any situation where dissimilarity of two 
discrete distributions may be of interest. For instance, in SELEX experiments, each urn could represent 
a random RNA pool and each draw a possible solution to a particular binding site problem over that 
pool. The dissimilarity of these pools is then related to the probability of finding binding site solutions 
in one pool that are absent in the other. 

Introduction 

An inescapable problem in microbial ecology is that a sample from an environment typically does not ob- 
serve all species present in that environment. In [Ij, this problem has been recently linked to the concepts 
of coverage probability (i.e. the probability that a member from the environment is represented in the 
sample) and the closely related discovery or unobserved probability (i.e. the probability that a previously 
unobserved species is seen with another random observation from that environment). The mathematical 
treatment of coverage is not limited, however, to microbial ecology and has found applications in varied 
contexts, including gene expression, microbial ecology, optimization, and even numismatics. 

The point estimation of coverage and discovery probability seem to have been first addressed by Turing 
and Good [2^ to help decipher the Enigma Code, and subsequent work has provided point predictors and 
prediction intervals for these quantities under various assumptions [l|[3||5]. 

Following Robbins 16] and in more generality Starr [t] , an unbiased estimator of the expected discovery 
probability of a sample of size n is 

Efe4-^^(fc:-+o, (1) 

k=l\k) 

where N(k, n + r) is the number of species observed exactly fc-times in a sample with replacement of 
size {n + r). Using the theory of U-statistics developed by Halmos [s], Clayton and Frees [9] show that 
the above estimator is the uniformly minimum variance unbiased estimator (UMVUE) of the expected 
discovery probability of a sample of size n based on an enlarged sample of size {n + r). 

A quantity analogous to the discovery probability of a sample from a single environment but in the 
context of two environments is dissimilarity^ which we broadly define as the probability that a draw in one 
environment is not represented in a random sample (of a given size) from a possibly different environment. 
Estimating the dissimilarity of two microbial environments is therefore closely related to the problem 
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of assessing the species that are unique to each environment, and the concept of dissimilarity may find 
apphcations to measure sample quality and allocate additional sampling resources, for example, for a more 
robust and reliable estimation of the UniFrac distance Topj" between pairs of environments. Dissimilarity 
may find applications in other and very different contexts. For instance, in SELEX experiments 12 — a 



laboratory technique in which an initial pool of synthesized random RNA sequences is repeatedly screened 
to yield a pool containing only sequences with given biological functions — the dissimilarity of two RNA 
pools corresponds to the probability of finding binding site solutions in one pool that are absent in the 
other. 

In this manuscript, we study an estimator of dissimilarity probability similar to Robbins' and Starr's 
statistic for discovery probability. Our estimator is optimal among the appropriate class of unbiased 
statistics, while being approximately normally distributed in a general case. The variance of this statistic 
is estimated using a consistent jackknife. As proof of concept, we analyze samples of processed V35 16S 



rRNA data from the Human Microbiome Project 13 



Probabilistic Formulation and Inference Problem 

To study dissimilarity probability, we use the mathematical model of a pair of urns, where each urn has 
an unknown composition of balls of different colors, and where there is no a priori knowledge of the 
contents of either urn. Information concerning the urn composition is inferred from repeated draws with 
replacement from that urn. 

In what follows, Xi , X2 , . . . and Yi,Y2, . . . are independent sequences of independent and identically 
distributed (i.i.d.) discrete random variables with probability mass functions P^; and Pj,, respectively. 
Without loss of generality we assume that P^; and Py are supported over possibly infinite subsets of 
N = {1,2,3, . . .}, and think of outcomes from these distributions as "colors": i.e. we speak of color-1, 
color-2, etc. Let denote the set of colors i such that f'x{i) > 0, and similarly define ly. Under this 
perspective, Xk denotes the color of the k-th ball drawn with replacement from urn-x. Similarly, Yk is the 
color of the fc-th ball drawn with replacement from mn-y. Note that based on our formulation, distinct 
draws are always independent. 

The mathematical analysis that follows was motivated by the problem of estimating the fraction of 
balls in urn-x with a color that is absent in urn-y. We can write this parameter as 

0x.v{^) := V PAi) = lim O^.yik), (2) 

^ — ^ A;— foo 

»e(/x\/„) 

where 

OxAk) ^ {Yi,...,Yk}). (3) 

The parameter 9x,y{oo) measures the proportion of urn- a; which is unique from urn-?/. On the other hand, 
Ox,y{k) is a measure of the effectiveness of fc-samples from urn-y to determine uniqueness in urn-x. This 
motivates us to refer to the quantity in ([2]) as the dissimilarity of urn-x from urn-y, and to the quantity 
in (|3| as the average dissimilarity of urn-x relative to k-draws from urn-y. Note that these parameters 
are in general asymmetric in the roles of the urns. In what follows, urns-x and -y are assumed fixed, 
which motivates us to remove subscripts and write 9{k) instead of 9x,yik). 

Unfortunately, one cannot estimate unbiasedly the dissimilarity of one urn from another based on 
finite samples, as stated in the following result. (See the Materials and Methods section for the proofs of 
all of our results.) 

Theorem 1. (No unbiased estimator of dissimilarity.) There is no unbiased estimator of 9 {00) based on 
finite samples from two arbitrary urns-x and -y. 
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Furthermore, estimating 9{cx)) accurately without further assumptions on the compositions of urns-x 
and -y seems a difFicuh if not impossible task. For instance, arbitrarily small perturbations of urn-y are 
likely to be unnoticed in a sample of a given size from this urn but may drastically affect the dissimilarity 
of other urns from urn-y. To demonstrate this idea, consider a parameter < e < 1 and let P:r(l) := 1, 
Pj^(l) := e and Py(2) := (1 - e). If e = then 0{oo) = 1 while, for each e > 0, 6'(oo) = 0. 

In contrast with the above, for fixed k, 6{k) depends continuously on {¥x,Vy) e.g. under the metric 

d((P„ Py), {P^',Fy,)) := IIP, - P,, II + \\Fy~ Fy, \\ , 

where ||i/|| := sup^^pj |i^(^)| = J2i denotes the total variation of a signed measure v over N such 

that i/(N) = 0. This is the case because 



< ^ IP, (z) - p., (z) I + A; ^ IP, (*) - P,- (*) I , 

i i 

< 2(fc+l)-d((P„P,),(P,-,P,0)- 



The above implies that 9{k) is continuous with respect to any metric equivalent to d. Many such metrics 
can be conceived. For instance, if (P™ x Fy) denotes the probability measure associated with m samples 
with replacement from urn- a; that are independent of n samples with replacement from urn-y then 9{k) is 
also continuous with respect to any of the metrics d,„,„ ((P,, Pj^), (P,/,Py')) ||(PJ' x Pp - (P^"] x P;;,)ll> 
with m,n > 1, because 

^ ( (Pa: 1 Fy ) , (P,' , Fy' 

))/2< )) < max{m,n} • d((P,,Py), (Pa;',Pj,')). 

Because of the above considerations, we discourage the direct estimation of 6{oo) and focus on the 
problem of estimating 0{k) accurately. 



Results 

Consider a finite number of draws with replacement Xi , . . . , Xn^ and Yi , . . . , Yn^ , from urn- a; and urn-y, 
respectively, where nx,ny > 1 are assumed fixed. Using this data we can estimate 0{k), ior k — 1 : Uy, 
via the estimator: 

where 

, J number of indices i = 1 : Ux such that , , 

' ' y color Xi occurs j-timcs in Yi, . . . , F„^. ^ ' 

We refer to Q(0), . . . , Q{ny) as the Q-statistics summarizing the data from both urns. Due to the well- 
known relation: * ~ ^(.^ + most (1 + ^luy) of these estimators are non-zero. This sparsity 
may be exploited in the calculation of the right-hand side of Q over a large range of fc's. 

Our statistic in Gik) is the U-statistic associated with the kernel \X\ ^ {Yi, . . . , Yfc}], where |-] 
is used to denote the indicator function of the event within the brackets (Iverson's bracket notation). 
Following the approach by Halmos in [s], we can show that this U-statistic is optimal amongst the 
unbiased estimators of Q{k) for k = \ : Uy. We note that no additional samples from either urn are 
necessary to estimate unbiasedly over this range when > 1. This contrasts with the estimator 
in equation ([T]), which requires sample enlargement for unbiased estimation of discovery probability of a 
sample of size n. 
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Theorem 2. (Minimum variance unbiased estimator.) If > 1 and Uy > k then 6{k) is the unique 
uniformly minimum variance unbiased estimator of 9{k). Further, no unbiased estimator of 9{k) exists 
for = or Uy < k. 

Our next result shows that 9{k) converges uniformly in probability to 0{k) over the largest possible 
range where unbiased estimation of the later parameter is possible, despite the non-Markovian nature of 
0{k) when applied sequentially over k. The result asserts that 9{k) is likely to be a good approximation 
of 0{k), uniformly for k — 1 : Uy, when Ux and Uy are large. The method of proof uses an approach by 
Hocffding for the exact calculation of the variance of a [/-statistic. 

Theorem 3. (Uniform convergence in probability.) Independently of how rix and Uy tend to infinity, it 
follows for each e > that 



lim P max |6l(/c) -6l(fc)| > e = 0. (6) 
We may estimate the variance of 9{k) for fc = 1 : Uy via a leave-one-out or also called delete-1 jackknife 



estimator, using an approach studied by Efron and Stein 15 and Shao and Wu 16 



To account for variability in the x-data through a leave-one-out jackknife estimate, we require that 
Ua; > 2 and let 

On the other hand, to account for variability in the y-data, consider for i > 1 and j > the statistics 

.... .^ J number of colors c such that color c occurs exactly , , 

■— ^ i-times in {Xi, . . . and j-times in (Fi, . . . ^ ' 

Clearly, -^Hh j) — Q{j)i particular, the M-statistics are a refinement of the Q-statistics. Define 
Sy{ny) := and, for k < Uy, define 

Slik) := 'hL_Y,Y,JM{^,J)[^{c,Mk)-c,{k)) + ^ik)-e{k)), (9) 



where 



"4 \ ) 

riy—k — l 

9y{k) cdk)QU)- (11) 

Our estimator of the variance of 9{k) is obtained by summing the variance attributable to the a:-data 
and the y-data and is given by 

S\k):^SUk) + S'y{k), (12) 

for fc = 1 : Tij,; in particular, S{k) is our jackknife estimate of the standard deviation of 9{k). 

To assess the quality of 5^(fc) as an estimate of the variance of 9{k) and the asymptotic distribution of 
the later statistic, we require a few assumptions that rule out degenerate cases. The following conditions 
are used in the remaining theorems in this section: 
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(a) 14 n Iy\ < oo. 



(b) there are at least two colors in [I^ D ly) that occur in different proportions in urn-y; in particular, 
the conditional probability Pj^(- | Ix ly) is not a uniform distribution. 

(c) urn-x contains at least one color that is absent in urn-y; in particular, 9(oo) > 0. 

(d) and riy grow to infinity at a comparable rate i.e. Ux = Q{ny), which means that there exist 
finite constants ci,C2 > such that ciUy < Ux < C2ny, as nx,ny tend to infinity. 

Conditions (a-c) imply that 9{k) has a strictly positive variance and that a projection random variable, 
intermediate between 0{k) and 0{k), has also a strictly positive variance. The idea of projection is 
motivated by the analysis of Grams and Serfling in 17|. 

Condition (d) is technical and only used to show that the result in Theorem [s] holds for the largest 
possible range of values of k namely, for k = 1 : Uy. See [iSj for results with uniformity related to 
Theorem |4] as well as uniformity results when condition (d) is not assumed. 

Because the variance of 0{k), from now on denoted Y{9{k)), and its estimate S'^(fc) tend to zero as Ux 
and Uy increase, the unnormalized consistency result is unsatisfactory. As an alternative, we can show 
that S'^(fc) is a consistent estimator relative to Y{9{k)), as stated next. 

Theorem 4. (Asymptotic consistency of variance estimation.) If conditions (a)-(c) are satisfied then, 
for each k > 1 and e > 0, it applies that 



lim 



S\k) 



W{9{k)) 



1 



> e 



0. 



(13) 



Finally, under conditions (a)-(d), we show that 9{k) is asymptotically normally distributed for all 



k = 1 



as Ux and Uy increase at a comparable rate. 



Theorem 5. (Asymptotic normality.) Let Z ^ A/'(0, 1) i.e. Z has a standard normal distribution. If 
conditions (a)-(d) are satisfied then 



lim max 

n^^TLy^oo k—l-.riy 



9{k)-9{k) 



w{e{k) 



< t 



= 0, 



(14) 



for all real number t. 

The non-trivial aspect of the above result is the asymptotic normality of 9{k) when k — Q{ny), e.g. 

only guarantee the asymptotic normality 



14 



19 



20 



9{ny), as the results we have found in the literature 
of our estimator of 9{k) for fixed k. We note that, due to Slutsky's theorem 21 , it follows from ( 13 ) and 
( 14 1 that the ratio 

9ik) - 9{k) 
S{k) 

has, for fixed k, approximately a standard normal distribution when Ux and Uy are large and of a 
comparable order of magnitude. 



Discussion 

As proof of concept, we use our estimators to analyze data from the Human Microbiome Project 
(HMP) [Is]. In particular, our samples are V35 16S rRNA data, processed by Qiime into an opera- 
tional taxonomic unit (OTU) count table format (see File SI in Supporting Information). Each of the 
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266 samples analyzed have more than 5000 successfully identified bacteria (see File S2 in Supporting 
Information). We sort these samples by the body location metadata describing the origin of the sample. 
This sorting yields the assignments displayed in Table [T] 

We present our estimates of 0{ny) for all 266 • 265 possible sample comparisons in Figure [ij i.e., we 
estimate the average dissimilarity of sample-x relative to the full sample- ?/. Due to Q, observe that 
0{ny) = Q(0)/ny. At the given sample sizes, we can differentiate four broad groups of environments: 
stool, vagina, oral/throat and skin/nostril. We differentiate a larger proportion of oral/throat bacteria 
found in stool than stool bacteria found in the oral/throat environments. We may also differentiate the 
throat, gingival and saliva samples, but cannot reliably differentiate between tongue and throat samples 
or between the subgingival and supragingival plaques. On the other hand, the stool samples have larger 
proportions of unique bacteria relative to other stool samples of the same type, and vaginal samples also 
have this property. In contrast the skin/nostril samples have relatively few bacteria that are not identified 
in other skin/nostril samples. 

The above effects may be a property of the environments from which samples are taken, or an effect of 
noise from inaccurate estimates due to sampling. To rule out the later interpretatio n, w e show estimates 
of the standard deviation of 0{ny) based on the jackknife estimator S'^{ny) from ( [l2j ) in Figure [2| As 

('^y) zero, the error estimate is given by Sx{ny). We see from (l7|), with k ~ Uy, that 



S{ny) = 



l 9{ny)-{l-d{ny)) ^ 



Assuming a normal distribution and an accurate jackknife estimate of variance, 9{ny) will be in the 
interval 0{ny) ± 0.01 with at least approximately 95% confidence, for any choice of sample comparisons 
in our data; in particular, on a linear scale, we expect at least 95% of the estimates in Figure [l] to be 
accurate in at least the first two digits. 

As we mentioned earlier, estimating 9(oo) accurately is a difficult problem. We end this section with 
two heuristics to assess how representative 0{ny) is of 0{oo), when urn-y has at least two colors and at 
least one color in common with urn-x. First, observe that: 

e(k)^0{^)+ (15) 

In particular, 9{k) is a strictly concave-up and monotonically decreasing function of the real-variable 
A; > 0. Hence, if 9{ny) is close to the asymptotic value 0{oo), then 6{ny) — 9{ny — 1) should be of 
small magnitude. We call the later quantity the discrete derivative of 0{k) at k — Uy. Since we may 
estimate the discrete derivative from our data, the following heuristic arises: relatively large values of 
I^C*^!/) ~ ~ 1)1 '^^^ evidence that 9{ny) is not a good approximation of 6{oo). 

Figureplshows the heat map of \9{ny) — 0{ny — \)\ for each pair of samples. These estimates are of order 
10~^ for the majority of the comparisons, and spike to 10~^ for several sample-?/ of varied environment 
types, when sample-x is associated with a skin or vaginal sample. In particular, further sampling effort 
from environments associated with certain vaginal, oral or stool samples are likely to reveal bacteria 
associated with broadly defined skin or vaginal environments. 

Another heuristic may be more useful to assess how close 9{ny) is to 9(oo), particularly when the 
previous heuristic is inconclusive. As motivation, observe that 9{k) = 9{oo) + Q{p'^), because of the 



identity in (15), where 

p := 1 — min Vyii). 

Furthermore, log(6'(fc — 1) — 9{k)) = k{\np) + c + o(l), where c is certain finite constant. We can 
justify this approximation only when \og{9{k — 1) — 9{k)) is well approximated by a linear function 
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of fc, in which case we let p denote the estimated vahie for p obtained from the Uncar regression. Since 
< 6{ny)~6{oo) < p"", the following more precise heuristic comes to light: 9{ny) is a good approximation 
of9{(X)) if the linear regression of log \ 0{k—l) — d{k)\ for k near Uy gives a good fit, S{ny) is small relative 
to 0{ny), and p"" is also small. 

To fix ideas we have applied the above heuristic to three pairs of samples: (255, 176), (200, 139) and 
(100,10), with each ordered pair denoting urn- a; and urn-y, respectively. As seen in Table [2] for these 
three cases, 9{ny) is at least 14-times larger than S{ny); in particular, due to the asymptotic normality 
of the later statistic, an appropriate use of the heuristic is reduced to a good linear fit and a small p"" 
value. In all three cases, p was computed from the estimates 0{k), with k = 5001 : Uy. 

For the (255, 176)-pair, p"" and the regression error, measured as the largest absolute residual asso- 
ciated with the best linear fit, are zero to machine precision, suggesting that 6{ny) = 0.9998 is a good 
approximation of 6{oo). This is reinforced by the blue plot in Figure |4j On the other hand, for the 
(200, 139)-pair, the regression error is small, suggesting that the linear approximation log(^(fc — 1) — 6{k)) 
is good for k — 5001 : Uy. However, because p"" = 0.9997, we cannot guarantee that 0{ny) is a good 
approximation of 9{oo). In fact, as seen in the red-plot in Figure^ 0{k), with fc = 1 : Uy, exposes a 
steady and almost linear decay that suggests that 6{oo) may be much smaller than 9{ny). Finally, for 
the (100, 10)-pair, the regression error is large and the heuristic is therefore inconclusive. Due to the 
green-plot in Figure |4j the lack of fit indicates that the exponential rate of decay of 9{k) to 9{oo) has not 
yet been captured by the data from these urns. Note that the heuristic based on the discrete derivative 
shows no evidence that 9{ny) is far from 9{oo). 

Materials and Methods 

Here we prove the theorems given in the Results section. The key idea to prove each theorem may be 
summarized as follows. 

To show Theorem [Tj we identify pairs of urns for which unbiased estimation of 0(oo) is impossible 
for any statistic. To show Theorem [2j we exploit the diversity of possible urn distributions to show that 
there are relatively few unbiased estimators of 9{k) and, in fact, there is a single unbiased estimator 9{k) 
that is symmetric on the data. The uniqueness of the symmetric estimator is obtained via a completeness 
argument: a symmetric statistic having expected value zero is shown to correspond to a polynomial 
with identically zero coefficients, which themselves correspond to values returned by the statistic when 
presented with specific data. The symmetric estimator is a U-statistic in that it corresponds to an average 
of unbiased estimates of 9{k), based on all possible sub-samples of size 1 and k from the samples of urn-a; 
and -y, respectively. As any asymmetric estimator has higher variance than a corresponding symmetric 
estimator, the symmetric estimator must be the UMVUE. 

To show Theorem [3] we use bounds on the variance of the U-statistic and show that, uniformly for 
relatively small A:, 9{k) converges to 9{k) in the £^-norm. In contrast, for relatively large values of fc, we 
exploit the monotonicity of 9{k) and 9{k) to show uniform convergence. 

Finally, theorems [4] and [H] are shown using an approximation of 9{k) by sums i.i.d. random variables, 
as well as results concerning the variance of both 9(k) and its approximation. In particular, the approxi- 
mation satisfies the hypotheses the Central Limit Theorem and Law of Large Numbers, which we use to 
transfer these results to 9{k). 

In what follows, T) denotes the set of all probability distributions that are finitely supported over N. 

Proof of Theorem [l| Consider in T> probability distributions of the form Pa;(l) = 1, ]Pjf(l) — u and 
Py{2) = (1 — u), where < u < 1 is a given parameter. Any statistic h{-) which takes as input Ux draws 
from nrn-x and Uy draws from urn-y has that Ei{h{Xi, . . . ,X„^, Yi, ...,!"„)) is a polynomial of degree 
at most Uy in the variable u; in particular, it is a continuous function of u over the interval [0, 1]. Since 
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6{oo) = |u = 0| has a discontinuity at u = over this interval, there exists no estimator of 9{oo) that is 
unbiased over pairs of distributions in V. □ 

We use lemmas 6pT to first show Theorem [2j The method of proof of this theorem follows an 
approach similar to the one used by Halmos [s] for single distributions, which we extend here naturally 
to the setting of two distributions. 

Our next result implies that no uniformly unbiased estimator of 9{k) is possible when using less than 
one sample from urn-x and k samples from urn-y. 

Lemma 6. If g{Xi, . . . , Xm, Yi, . . . , y„) is unbiased for 9{k) for all V^, Vy G "D, then m > 1 and n > k. 

Proof. Consider in V probability distributions of the form Pa;(l) = u, PxC^) = (1 ^ u), Py{l) — v and 
Fy{2) = (1 — v), where < u, < 1 are arbitrary real numbers. Clearly, E[g{Xi, . . . , Xm, Yi, . . . , Yn)] is 
a linear combination of polynomials of degree to in u and n in w and, as a result, it is a polynomial of 
degree at most to in u and n in v. Since 9{k) — u{l — v)^ + (1 — u)v^ has degree 1 in u and k in v, and 
g{Xi, . . . , Xnn . . . , Yn) is Unbiased for 6'(fc), we conclude that 1 < to and k < n. □ 



The form of 6{k) given in equation (|4|) is convenient for computation but, for mathematical analysis, 
we prefer its [/-statistic form associated with the kernel function (x, yi, . . . , yk) {x ^ {yi, . . . , j/fe}]. 
In what follows, Sk,n denotes the set of all functions a : — ^ {l,...,ny} that are one-to-one. 



Lemma 7. 

1 

i=i o-eSfc, 
where \Sk.nJ^kl{ly). 



Hk) = ^ E lX^^{Y<ril},■■■,Y,ik)}l (16) 



Proof. Fix I < i < n-x and suppose that color Xi occurs j-times in Yi , . . . , Yn^ . If i > {uy — k) then any 
subhst of size k of Yi, . . . , Yn^ contains Xi, hence [Xi ^ {^^(i), . . . , YtT(fe)}l = 0, for all a G Sk,ny- On 
the other hand, if j < [uy — k) then X^o-eSk „ ^ {^o-(i)' ■ ■ • ' ^tT(fc)l = ^'(""fc "')• ^i'^ce the rightmost 
sum only depends on the number of times that color Xi was observed in Yi , . . . , , we may use the 
Q-statistics defined in equation ([s]) to rewrite: 

E E i^^^{^^(i)'---'^^w}i = -iYE V\ ' QW- 



The right-hand side above now corresponds to the definition of ^(fc) given in equation Q. □ 
In what follows, we say that a function / : p!}"^+"« — > M is {nx,ny)- symmetric when 
f{xi, ...,Xn,;yi,.. . = f{x^i^i), . . • , a;<^(„j; yff'(i), ■ • • ,2;o-'(n„)), 

for all Xi, . . . , , . . . , j/„^ e N and permutations a and cr' of l,...,nx and l,...,ny, respectively. 
Alternatively, / is (n^;, ny)-symmetric if and only if it may be regarded a function of „^),?;(i 
where a;(i...„^) and y(i...n ) correspond to the order statistics a;(i), . . . ,a;(„^) and y{i), ■ ■ ■ ,y(n ): respec- 
tively. Accordingly, a statistic of (ATi, . . . , A"„^, Yi, . . . , y„^) is called {nx,ny) -symmetric when it may 
be represented in the form /(Xi, . . . ,X„^, Yi, . . . ,y„^), for some (n^;, ny)-symmetric function /. It is 
immediate from Lemma[7]that 9{k) is (n^;, ny)-symmetric. 

The next result asserts that the variance of any non-symmetric unbiased estimator of 9{k) may be 
reduced by a corresponding symmetric unbiased estimator. The proof is based on the well-known fact 
that conditioning preserves the mean of a statistic and cannot increase its variance. 
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Lemma 8. An asymmetric unbiased estimator of 6{k) that is square-integrahle has a strictly larger 
variance than a corresponding (n^^ny) -symmetric unbiased estimator. 

Proof. Let J- denote the sigma-field generated by the random vector (X(i...n^); 5^(1. )) and suppose that 
the statistic T — f(Xi,...,Xn^,Yi,...,Y„^) is unbiased for 6{k) and square-integrable. In particular, 
U = E[T| J^] is a well-defined statistic and there is an (n^:, nj,)-synimetric function g : N"^+"h ^ M such 
that U — g{Xi, . . . , Xn^;Yi, . . . ,Yn ). Clearly, U is unbiased for 9{k) and (ti^;, ?^^^)-synl^letric. Since 
E(T^) < +00, Jensen's inequality for conditional expectations [22j implies that E([/^) < E(T^), with 
equality if and only if T is (n^;, rij,)-symmetric. □ 

Since 6{k) is (n^:, riy)-symmetric and bounded, the above lemma implies that if an UMVUE for 9{k) 
exists then it must be (n^;, nj,)-symmetric. Next, we show that there is a unique symmetric and unbiased 
estimator of 9{k), which immediately implies that 9{k) is the UMVUE. 

In what follows, fci, fc2 ^ denote integers. We say that a polynomial Q{ui, . . . , u„i; wi, . . . , w„) is 
{kl, k2) -homogeneous when it is a linear combination of polynomials of the form YYiLi 11^=1 "^j ' ' with 
YliLi "^i = ^1 ^-nd X)J=i "-i ~ ^2- Furthermore, we say that Q satisfies the partial vanishing condition if 
Q{ui, . . . . . . ,u„) = whenever ui, . . .,Um,vi, . . . ,w„ > 0, YhLi Ui = 1 and J2i=i = 1- 

The next lemma is an intermediate step to show that a (ki, fc2)-homogeneous polynomial which satisfies 
the partial vanishing condition is the zero polynomial, which is shown in Lemma [10} 

Lemma 9. If Q is a (fci, -homogeneous polynomial in the real variables ui, . . . , u„i, ui, . . . , u„, with 
m,n > 1, that satisfies the partial vanishing condition, then . . . , u™; di, . . . , f„) = whenever 

Ui,...,Um,Vi,...,Vn > 0, 1 > 0,'^^. J2i=lVi > ^■ 

Proof. Fix ui, . . . , Um, vi, . . . ,Vn > such that Ui > and J2i=i > ^^'^ observe that 

Um Vl Vn \ 

because Q is a (fci, fc2)-homogeneous polynomial. Notice now that the right hand-side above is zero 
because Q satisfies the partial vanishing condition. □ 

Lemma 10. Let Q be a {ki, k2) -homogeneous polynomial in the real variables ui, . . . , Um, wi, . . . , Vn, with 
771, n > 1. If Q satisfies the partial vanishing condition then Q ~ identically. 

Proof. We prove the lemma using structural induction on (777, n) for all fci, fc2 ^ 0. 

If 777 = 77 = 1 then a (fci, fc2)-homogeneous polynomial Q{ui, vi) must be of the form cuj^u^^, for an 
appropriate constant c. As such a polynomial satisfies the partial-vanishing condition only when c — 0, 
the base case for induction is established. 

Next, consider a (/ci, fc2)-homogeneous polynomial Q{ui, . . . , Um] Vi, . . . , Vn,Vn+i), with m,n > 1, that 
satisfies the partial vanishing condition, and let d denote its degree with respect to the variable Vn+i- In 
particular, there are polynomials Qq, . . . , Qd in the variables ui, . . . , Um, vi, ■ ■ ■ ,Vn such that 

d 

Q{ui,. . . , Ujn] Vl,. .. , Vn,Vn+l) = ^ Qi{ui, . . . ,?7,„; , W„)w^_|.i. 

i=0 

Now fix 77,1, . . . , Ujn, Vl, . . . ,Vn > such that YliLi Ui > and Vi > 0. Because Q satisfies the partial 

vanishing condition. Lemma [9] implies that Yli=o Qi{v-i, . • . , Um', vi, . . . , Vn)vl^_^_i = for all 7;„+i > 0. In 
particular, for each i, Qi(ui, . . . , Um', vi, . . . , u„) = whenever 77,1, . . . , Um, t^i, • • ■ , w„ > 0, X^i^i w-i > and 
X]r=i ^ 0- Thus each Qi satisfies the partial vanishing condition. Since Qi is a (fci, fc2 — 7)-homogeneous 
polynomial, the inductive hypothesis implies that Qi = identically and hence Q — Q identically. The 



Q{ui,...,Ura]Vi, 
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same argument shows that if Q{ui, . . . , Um, Wm+i; vi^ . . . , w„), with to, 71 > 1, is a (fci, fc2)-homogeneous 
polynomial that satisfies the partial vanishing condition then Q ~ Q identically, completing the inductive 
proof of the lemma. □ 

Our final resultbefore proving Theorem |2] implies that 9{k) cannot admit more than one symmetric 
and unbiased estimator. Its proof depends on the variety of distributions in T), and uses the requirement 
that our estimator must be unbiased for any pair of distributions chosen from T). 

Lemma 11. // / is an {ux, Uy)- symmetric function such that E[f(Xi, . . . , ,Yi, . . . ,Yn )] = 0, for all 
Vx,Vy G V, then f — identically. 

Proof. Consider a point z= (xi , . . . , Xn^ ,yi, . . . , yny ) & N"^"*""" and define TOi and TO2 as the cardinalities 
of the sets {xi , . . . , a;„^ } and {yi , . . . , yuy}, respectively. Furthermore, let x'^, . . . , x'^_^ denote the distinct 
elements in the set {xi^ . . . , x^^} and define mi i to be the number of times that x'^ appears in this 
set. Furthermore, let P^, e 2? be a probability distribution such that Pxi{x'i, . . . ,a;^ }) = 1 and define 
Pi^i := Vx{x'^). In a completely analogous manner define y'l, . . . ,^^2' "^zj", and p2.j. 

Notice that E[/(Zm)] is a polynomial in the real variables pi^i, . . . ,Pi,mi,P2,i, • ■ • ,P2,m2 that satisfies 
the hypothesis of Lemma |10[ in particular, this polynomial is identically zero. However, because / is 
(na;,ny)-symmetric, the coefficient of n?=i IlJ^i Pi^ IE[/(Z,„)] is 



Ux \ Uy 

"ii,i; . . . ;rni^„iiJ vTO2,i; ■ • ■ ; "12,^2 



implying that f{z) = 0. □ 

Proof of Theorem [2| From Lemma |8j as we mentioned already, if the UMVUE for 9{k) exists 
then it must be (nj,, nj,)-symmetric. Suppose there are two (n^,, ny)-symmetric functions such that 
f{Xi,..., Xn^ ; Yi , . . . , y„ ) and 5(^1 , ... , ; Fi , . . . , r„ ) are unbiased for B{k) . Applying Lemma [TT 



to (/ — g) shows that f = g, and 6{k) admits therefore a unique symmetric and unbiased estimator. 
From Lemma[7[ 9{k) is {ux, n^) -symmetric and unbiased for 9{k) hence it is the UMVUE for 9{k). From 
Lemma [6j it follows that no unbiased estimator of 9{k) exists for rij, = or riy < k. □ 

Our next goal is to show Theorem [3) for which we prove first lemmas [l2][T3l We note that the later 
lemma applies in a much more general context than our treatment of dissimilarity. 

Lemma 12. //, for each n > 1, kn > I is an integer such that fc^ = o(n) then 

«)(::» *'.ofa; (17, 



/ k\ ( 71— k\ 



n 



.kJ 



(i)(fe-j) _ ^(k 

J =2 

uniformly for k — 1 : k^ as n ^ 00. 

Proof. First observe that for all n sufficiently large and fc = 1 : fc„, it applies that 



il)ir-l) fc^j^A k-i \ k 



n 

i=0 
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Note that —x/{\ — x) < log(l — x) < —x, for all < a; < 1. As a result, we may bound the exponential 
factor on the right-hand side above as follows: 



' fe-2 



e-^'^-i) /("-2'=+2)<exp<^^log 1 



. i=0 



k-1 

n — 1 — i 



< g-(fe-i)V("-i)_ 



Since e-('=-i)V(»-2fc+2) ^ ^ _^ 0{k^/n) and e-('=-i)V("-i) ^ i + 0{k'^/n), uniformly for all fc = 1 : A:„ as 
n — > oo, (17) follows. 



To show (18), first note the combinatorial identity 

k (k\ /n-k\ 
\ " VjAfc-j7 



(19) 



J=0 



Proceeding in an analogous manner as we did to show (17 1, we see now that the term associated with 
the index j = in the above summation satisfies that 



g-feV(n-2fe+l) < 



1) . 



for all n sufRciently large and k — 1 : kn- Since e '^■^/(" 2A;+i) ^ ^ _ ^^2^^ _|_ (9(/j4^^2-j g^j^j-^ ^ fe^/n _ 
1 — fc^/n + 0{k'^/n'^), the above inequalities together with (17) and (19l establish (18). □ 

Lemma 13. Define A(fc) := E(h{Xi,Yi, . . . ,Yk)), where h(xi,yi, . . . ,yk) is a bounded (1, k)- symmetric 
function, and let 



^x\Sk.n 



y i=i o-eSfc 



be the U-statistic of A(fc) associated with draws from urn-x and Uy draws from urn-y; in particular, 
¥,{X{k)) = X{k). Furthermore, assume that 

(i) 0<h<l, 

a s 

(ii) there is a function f '■ Ix ^ [0, 1] such that lim h{Xi, Yi, . . . , Yk) f{Xi), 

/c— f oo 

(Hi) X{k) > A(fc + 1); in particular, X{k) > A(fc + 1). 
Under the above assumptions, it follows that 



lim E max 



X{k) - A(fc) 



0. 



Proof. Define fc„ := 1 + min | [ny^J , [log(nj,)J}; in particular, 1 < fc„ < rij, and fc„ — > oo, fc„ — o{nx) 
and fcP = o{ny), for any p > 0, as n^, Uy — > oo. The proof of the theorem is reduced to show that 



lim E max |A(fc)-A(fc)|' 



0; 



lim E max |A(fc) - A(fc)|^ = 0. 

"a;,"^— i-OO \k = l:k„ 



(20) 
(21) 
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Next, we compute the variance of X{k) following an approach similar to Hoeffding fU. Because h is 
(1, fc)-symmetric, a tedious yet standard calculation shows that 

E (■)(r7')(("^-i)eo,,(fc)+a,,(fc)) 

V(A(fc)) = '-^ , (22) 



n 



^ \ k 



where 



eo,,(fc) Y{E{h{X,,Yi,...,Yk)\Yi,...,Yj)); (23) 

ei,,(fc) := Y{EihiX,,Y,,...,Yk)\X,,Y,,...,Yj)). (24) 

Clearly, ^i,j{k) < 1. On the other hand, if W is any random variable with finite expectation and 
J^i C 7^2 are sigma-fields then Y{E{W\J^i)) < Y{E{W\T2)), due to well-known properties of conditional 
expectations |22| . In particular, for each < j < k, we have that 

^o.j{k) < Co,fc(fc), and ^i,j(fc) < Ci,fc(fc). (25) 



Consequently, ( 22 ) implies that 

V(A(fc)) < CoAk) + — • (26) 



We claim that 



lim Co,fc(fc) =0. (27) 
Indeed, using an argument similar as above, we find that 

^o,k{k) = v(E(MXi,ri,...,n)-/(Xi) |yi,...,y,)), 

< v(E(/i(Xi,ri,...,n)-/(Xi) I yi,...,n,Xi)), 
= v(/i(Xi,ri,...,yfe)-/(Xi)). 

Due to assumptions and the Bounded Convergence Theorem, the right-hand side above tends to 

0, and the claim follows. 



It follows from p6| and (27) that 



lim V(A(fc)) 0. 



Finally, because of assumption {iii) 



E ( max |A(fc) - \{k)\^ ) < 2|A(fc„) - \{ny)\^ + 2V(A(fc„)) + 2W(\{ny)). 

k—kn 



Since each term on the right-hand side above tends to zero as nx,ny — ^ oo, (20) follows. 
We now show (21 ). As ^ 1 and Co,o(fc) — 0, it follows by (p2| and Lemma 12 that 



uniformly for fc = 1 : A;„ as n^;, riy — > oo. In particular. 



E 



(max |A(fc)-A(fc)p) <fiv(A(fc))<^ + ^ + 



Due to the definition of the coefficients fc„, the right-hand side above tends to zero, and (21) follows. □ 
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,ykii- 

From 



Proof of Theorem [3[ Note that 6{k)^E{h{Xi,Yi, . . .,Yk)), with yi, . . . , yfe) := |a;i ^ {yi 
We show that the kernel function h and the U-statistics 6{k) satisfy the hypotheses of Lemma 
this the theorem is immediate because /^^-convergence imphes convergence in probabihty. 

Clearly h is (1, fc)-symmetric and < h < 1, which shows assumption (i) in Lemma 13 On the other 
hand, due to the Law of Large Numbers, Mmk^ao h{Xi,Yi, . . . ,Yk) — {Xi ^ lyj almost surely, from 
which assumption (ii) in the lemma also follows. 

Finally, to show assumption (iii), recall that Sk.n is the set of one-to-one functions from {1, . . . 
into {l,...,n}; in particular, |S'fc+i,n„| = {ny — k) ■ l^fc^n^l- Now note that for each indicator of the 
form iXi ^ {Ycr(i), ■ ■ ■ ,Y„(^k)}}, with a G Sk+i.uy, there are {uy — k) choices of (7{k + 1) outside the 
set {cr(l),...,CT(A:)}. Because {Xi ^ {^"^(i), • ■ • , ^^(fc)}! > I^i i {Y^(i), ■ ■ ■ ,Y„(k+i)}\. it follows that 
0{k) > 9{k + 1) for all fc = 1 : {uy — 1). This shows condition (iii) in Lemma 13 and Theorem [s] follows. □ 



Proof of equation ([t]). The jackknife estimate of the variance of 9{k) obtained from removing a single 
x-data is, by definition, the quantity 



2 



^^(fc):^^^ ^ Y: E lX,i{Y.a),---,y.ik)}]-m] ■ (28) 

Note that removing a color from the x-data which would otherwise add to Q{j), decrements this 
quantity by one unit. Let Qi denote the Q-statistics associated with the data when observation Xi from 
urn-a; is removed from the sample. Note that as each draw from urn-x contributes to exactly one Q{j), 
Qi{j) = QU) for all j except for some j* where Qi{j*) — Qij*) — 1. We have therefore that 



{%')Q(i) 



V k . 



^E "'*> 




Since there are Q{j) draws from urn-x which contribute to Q{j), the above sum may be now rewritten 
in the form given in ([?]). □ 

Proof of equation (joj). Similarly, Sy{k) corresponds to the jackknife summed over each possible deletion 
of a single y-data, which is more precisely given by 

- 1 / 1 \ ^ 

y r=l \ 2:1 r| -^^ ^^^^ ) 

where Sr is the set of one-to-one functions from {1, . . . , fc} into {1, . . . , riy} \ {r}. 

Recall that M{i,j) is the number of colors seen i times in draws from urn-x and j times in draws 
from urn-y, giving that '^iiM{i, j) = Q{j). 

Fix 1 < r < Uy and suppose that F,. is of a color that contributes to AI{i*,j*), for some 
Removing Yr from the data decrements M{i*,j*) and increments M{i*,j* — 1) by one unit. Proceeding 
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similarly as in the case for S'^{k), if Mr is used to denote the M-statistics when observation Yr is removed 
from sample-y, then 

" r=l \ j=0 ' ' i=l 



rij, - 1 



, 1 I Sr I Tla; I Sr \ \ Sr 

r—1 \ J— 

^ [l* (c,;_i(fc) - C,.(fc)) + §y{k) - ^(fc) 



where Cj{k) and 0j/(A;) are as defined in (10) and ( |ll[ ). Noting that for each i there are j draws from 
urn-j/ that contribute to M{i,j), the form in ^ follows. □ 



In what follows, we specialize the coefficients in (231 and (24) to the kernel function of dissimilarity, 
h{xi,yi, . . . , yk) ■= {xi ^ {yi, . . . , ykj- From now on, for each j > and fc > 1, define 

^oAk) ■■= v(P(Xi ^ {Yi, . . .,Yk}\Y,, . . . (30) 

:= V(P(Xi ^ {Yi, . . . , yJlXi, Fi, . . . , r,)). (31) 

Above it is understood that the sigma-field generated by (Fi, . . . , Yj) when j = is {0, fi}; in particular, 
^0 o(fc) = 0, for all A: > 1. 

The following asymptotic properties of S.i,j{k) are useful in the remaining proofs. 

Lemma 14. Assume that conditions (a)-(c) are satisfied and define c :— min^gj/^p/ ^¥y{i). It follows 
that < c < 1 and < 9(oo) < 1. Furthermore 

a,;c(fc)-a,o(fc) = e((l-c)'=); (32) 
a.o(fc) = 0(oo) (1 - 0(oo)) + e((l - c)'=); (33) 
eo,/c(fc) =0((l-c)*). (34) 

Proof. Observe that conditions (a)-(b) imply that < c < 1. In addition, condition (b) implies that 
9{oo) < 1, whereas condition (c) implies that 6{oo) > 0. 
Next, consider the set 

/* := {i e {Ix n ly) such that ¥y{i) = c}, 
i.e. /* is the set of rarest colors in urn-y which are also in urn- a;. Also note that 

e{k)^0{^)+ (35) 
ieiia:niy) 



As an intermediate step before showing (32), we prove that 

^,^^{k) = e{2k-j)-9^{k). (36) 

For this, first observe that 

P(Xi i {Fi, ...,Yk}\ Xi,yi, ...,Yj} = IX, i {Fi, . . . ,y,}i(i - Vy{x,)f-^. 
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Hence 



= E(p(Xi ^ {Fi, . . . , Y2k-j} I Xi, . . . , r,)), 



from which (36) now easily follows. 
To show ( 32 ) note that ( 36 ) implies 



J2 p.w(i-p,«)ni-(i-p.«)'), 

(l-c)'=^P.W + o((l-cf), 
iei* 



which establishes (32 1. 
Now note that 



^P.(z)(l-P,(z))''- ('^P.(z)(l-P,W)'] , 
9{oo) - 6*2(00) - 26l(oo)(l - c)*^ V^i)^ + o(l " c)^ 



which establishes ( 33 1 . 

Next we show ( |34[ ), which we note gives more precise information than (27). Consider the random 
variable T defined as the smallest n > 1 such that {IxCiIy) C {Yi, . . . , F„}. We may bound the probability 
of T being large by P(r > fc) < n(l — c)*^, where n := \Ix n Iy\ is finite because of condition (a). On the 
other hand, note that 

P(Xi i {Y^,. . . ,Yk} \Yu. . . ,Yk) = 1 - Vx{{Y^,. . . , Y,}). 

Define Wk '-^ I - Vxi{Yi, . . ■ ,Yk}) - e{k) and observe that, over the event T <k,Wk^ 9{oo) - e{k). 
Since \Wk\ < 1, we obtain that 



e(e(w^2 I t > fc)) •P(r > fc) + e(e(w^2 I t < fc)) •P(T < fc). 



< 



< 



P(T > fc-) +E(E(iy| I r < fc)). 



The identity in equation ( 34 1 is now a direct consequence of ( 35 ) . 



□ 



Our next goa l is to show Theorems |4] and [5] To do so we rely on the method of projection by Grams 
and Serfling 17 . This approach approximates ^(fc) by the random variable 

ep{k) := 9{k) + £(E(^(fc)|X,) - 9{k)) + Y,{¥.{e{k)\Y,) - 9{k)). 
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The projection is the best approximation in terms of mean squared error to 0{k) that is a hnear combi- 
nation of individual functions of each datapoint. 

Under the stated conditions, Op{k) is the sum of two independent sums of non-degenerate i.i.d. random 
variables and therefore satisfies the hypotheses of the classical central limit theorem. The variance of the 
projection is easier to analyze and estimate than the [/-statistic directly, which is relevant in establishing 
consistency for the jackknife estimation of variance. 

Let 

R{k) ■.= e{k)-§p{k), 

be the remainder of 9{k) that is not accounted for by its projection. When R{k) is small relative to 
6p{k), 9{k) is mostly explained by 9p{k) in relative terms. 

The next lemma summarizes results about the asymptotic properties of R{k), particularly with relation 
to the scale of 6p{k) as given by its variance. 



Lemma 15. We have that 



.{k))^n-'CiAk) + k'n-'CoAk), 



E{R'^{k)) =Wi9{k)) -W{9p{k)). 



Under assumptions (a)-(c), for a fixed k > 1, we have that 

Wi9{k)) 



lim 

n^,ny->oo Y{9p{k)) 

Furthermore, under assumptions (a)-(d) we have that 

Y{9{k)) 



lim max 



lim max 



V(0p(fc)) 

9{k)-9p{k) 
'^{9p{k)) 



> e = 0; 



for all e > 0. 



Proof. A direct calculation from the form given in ( 16 ) gives that 

E{9{k)\X,)) = n-ip(X, ^ {Fi, . . . , Yk}\X,) + (l - n'^) 9{k); 
Y{Ei9{k)\X,)) - 

E{9{k)\Y,)) = kn-^P{X, ^ {r„ . . . , rfc+,_i}|y,) + (l - kn-^) 9{k); 



YiE{9{k)\Y,)) = 



(37) 
(38) 



(39) 



(40) 
(41) 



(42) 



(43) 



As W{9p{k)) = n^W(E{9{k)\X,)) + nyY{E{9ik)\Yi)), (|37|) follows. 
To show ( 38 ) , first observe that 

E(i?2(fc)) = V(i?(fc)) = Y{9{k)) + Y{9p{k)) - 2 Coy {9 (k), 9 p{k)). 



(44) 
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Next, using the definition of the projection, we obtain that 

Cov(^(fc),^p(fc)) = ^Cov(0(fc),E(0"(fc)|X,)) + ^Cov(^(fc),E(^(fc)|y,-)), 



£v(E(^(fc)|X,))+gv(E(^(fc)|r,)), 



from which (38) foUows, due to the identity in (44). Note that the last identity implies that 9p{k) and 
R{k) are uncorrelated. 



Before continuing, we note that (|41j) is a direct consequence of (38 1, (40) and Chebyshev's inequal- 
ity [22] . To complete the proof of the lemma all reduces therefore to show (41) under conditions (a)-(d). 
Indeed, if 6 > 1 and we let fc„ — logf,{ny) then due to the identities in (22 1 and (37) and Lemma 12 we 
obtain under (a)-(c) that 

wim) - v(^p(fc)) + o (— + ^ 



uniformly for all fc = 1 : fc„, as n^;, rt^ — > 00. Since W{dp{k)) > for all A; > 1, we have thus shown (39). 
Furthermore, note that Ci,o(fc) > for all fc > 1; in particular, due to (33) and conditions (a)-(d), we can 
assert that inik>i £,ifl{k) > 0. Since £,o,i{k) > 0, the above identity together with the one in (37l let us 
conclude that 



max 

fe=l:fc„ 



v(g(fc)) 

V(^p(fc)) 



- 1 



= o 



Tlx kj^ 



as rtj;, Hy — >■ 00. Because of condition (d), the big-0 term above tends to 0. As a result: 



lim max 

nx,n„— >oo k=l:k„ 



^idp{k)) 



- 1 



= 0. 



(45) 



On the other hand, (38) implies that Y{9{k)) > W{6p{k)). Hence, using (19) and (25) to bound from 



above the variance of the U-statistic, we obtain 

^O.kik) -t- Cl,fc(fc)/"-i; 



(fc)) " fc2^o,i(A:)K+a,o(fc)M " a,o(fc) a,o(fc) 



l + n, •0((l-c)'=), 



as fc — cx), where for the last identity we have used (32) and (34). Since — 8(ny), it follows from the 
above identity that 



max 



V(0(fc)) 



0{ny{l - c)^-") - 0(ni+'°s->(i--)). 



V(0p(fc)) 

In particular, if the base-6 in the logarithm is selected to satisfy that 1 < & < 1/(1 — c), then 



lim max 



Y{Op{k)) 



- 1 



= 0. 



(46) 



The identities in equation (45) and (46 1 show (41), which completes the proof of the lemma. 



□ 
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Proof of Theorem [sj For a fixed k, note that 9p{k) is ttie sum of two independent sums of non- 
degenerate i.i.d. random variables and thus, 

iep{k)-eik))/^Yiep{k)) 

is asymptotically a standard Normal random variable as nx,ny — > oo by the classical Central Limit 
Theorem. We would like to show however that this convergence also applies if we let k vary with and 
Uy. We do so using the Berry-Esseen inequality [23]. Motivated by this we define the random variables 

^^E(0(fc)|X,)-0(fc)_ 



/V(ep(fc)) 
E{9{k)\Y,) - ejk) 

Note that E(X,'(fc)) = E{Y^{k)) = 0, and that 

gE(K'(fc)P) + 5^E(|i;'(fc)n = l. 
i=i j=i 

We need to show that 

Y,mxm\')+T.^{\yj{k)\')^o{i), (4?) 

uniformly for fc = 1 : 

Note that from Q and (|4|]), 

,3 |P(X,:^{Fi,...,Ffe}|X,)-0(A:)|3 



\E{eik)\X,) - eik)\ 

mkm - oik)\^ = k^\nx.^{Y.,...,Y^,.^m)-om^ 



Let 

?7i,o(fc) :=E|P(X, ^ {ri,...,n}|X,)-0(fc)h 
?7o,i(fc) :=E|P(Xi ^ {Y,,...,Yk+i-.,}\Yi)^9{k)\^ 



It follows from (37) that 

rii.o{k)/nl + P'qo^i{k)/nl _ rii^o{k) /nl + Pfjo,! ik)/nl 



Y^E{\X[{k)\^) + Y.^{\Y;{k)\^)= , - 



But note that < ?7o,i(A;) < ^o,i(fc)- Since, according to Lemma [l4| ^o,i(^) decreases exponentially fast, 
we obtain 

uniformly for all fc = 1 : Uy, as Uy — ^ oo. On the other hand, < ?7i,o(^) < ^i,o(fc) < 1- Furthermore, 
(33) implies that inf/j>i £,i.o{k) > 0. Since n^; = Q{ny), for some finite constant C > we find that 



inf ^i.o(fc)/na 

A:> 1 
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which shows (47 1. 

The above estabhshes convergence in distribution of {Op(k) — 9{k))/ \jY{9p{k)) to a standard normal 
random variable uniformly for fc = 1 : riy, as rij^, rij, cx). The end of the proof is an adaptation of the 
proof of Slutsky's Theorem [21]. Indeed, note that 



e{k) - e{k) 



Y(e{k)) 



< t 



.{k)-e{k) 0{k)-0p{k) 



Y{ep{k)) jYiOpik)) 



From this identity, it follows for any fixed e > that 



e{k) - 0{k) 



<t \ < 



ep{k)~e{k) 



I Y{e{k)) 
V(^p(fc)) 



0{k)-6p{k) 



W{0p{k)) 



(48) 



> e 



The first term on the right-hand side of the above inequality can be made as close to ¥[Z < t + e] a.s 
wanted, uniformly for k — 1 : ny, as nx,ny ^ oo, because of (40 1. On the other hand, the second term 
tends to uniformly for k = 1 : Uy because of (41). Letting e — > 0+, shows that 

0{k) - 0(k) 



lim sup max 



<t\ <V[Z < t]. 



Similarly, using (48), we have: 



0{k)-0{k) 



Y(0{k)) 



<t\ > 



,{k)-0{k) 



V(0(fc)) 



0{k)^0p{k) 



V(0p(fc)) 



> e 



and a similar argument as before shows now that 

0{k) - e{k) 



lim inf max 



which completes the proof of the theorem. 



<t \ > 



W{0{k)) 



<t], 



□ 



We finally show Theorem [4j for which we first show the following result. 

Lemma 16. Let SI. „ be the set of one-to-one functions from {1, . . . , fc} into {1, . . . , n}/{i}. Consider 
the kernel h{xi,yi,...,yk) := {xi ^ {yi, . . . ,yk}j, and define 



Olik) 



1 ^ 

1 n -1 ^ HXj,Y„(i),-..,Y„(k)); 



\Sk,n 



1 1 



k,n. 



E -E'^(^^-'^-^-(i)'---'^-('^-i))- 



(49) 
(50) 

(51) 
(52) 
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Then, for each k > 1 and e > 0, 

lim P 



lim 



E 

i=l 



(ei{k)-e^^{k) 



> e = 0; 



> e 



0. 



Proof. Fix k > 1. We first use a result by Sen |24 to show that, for eaeh i > 1: 

Hm 9l{k)^0ik); 

hm ^^'(fc) = E(;i(x,,ri,...,rfe)|x,); 



Tlx ,n^— j-oo 



hm ei{k) = e{k)- 



Tlx jUy-^OO 



hm 0'^{k)^EihiX,,Y„...,Y,+k-i)\Y,); 



7ix ,n^— j-oo 



(53) 
(54) 

(55) 
(56) 
(57) 
(58) 



in an ahxiost sure sense. Indeed, assume without loss of generality that i — 1. As the kernel functions 
found in (49) and (51 1 are bounded, the hypotheses of Theorem 1 in 24 are satisfied, from which (55) 
and (57) are immediate. Similarly, because Xi and Yi are discrete random variables, (56) and (58) also 
follow from j24j . 
Define 



U{k) 
V{k) 



mk)-et{k)f 



E 

1=1 



^ (^;(fc)-g;W 



(59) 
(60) 



and observe that 



V(C/(fc)) = 



Y[i9Uk)~el'ik))') 



n.^ - 1 . 



Cov({§l{k)-0l'{k))M§l{k)-ef{k)y 



Y{V{k)) = 



vUei{k)-0i'{k)f 



-Coy[ielik)--el,'{k)f,i§l{k)^el'{k)y 



Furthermore, due to (55)-(58), we have that 



lim 
lim 



= (^(fc)-E(Mx„yi,...,y,)|x,))2; 
9Uk)-6''{k)f = (0(fc)-E(MXi,y„...,y,+fc_i)|yo)' 



(61) 
(62) 



But note that, for i ^ j, {e{k) - E{h{X„ Yi, . . . , Yk)\Xi)y and {e{k) - E{h{Xj,Yi, . . . , Yk)\X^))^ are 
independent and hence uncorrelated. Similarly, the random variables {d(k)—E{h{Xi,Yi, . . . , Yi+k-i)\yi))'^ 
and (6l(A:)-E(/i(Xi^,...,12+/c-i) I Yj))2 are independent. Since |^^(fc)-^*'(/c)| < 1 and |^j,(fc)-^;'(fc)| < 
1, it follows from (61) and (62), and the Bounded Convergence Theorem that 



V(C/(fc))-o(l); 
V(y(fc)) = o(l); 



(63) 
(64) 
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Finally, by (30) and (31 1 it follows that 

ei,o(fc) = E {9{k) - E(h{Xi,Yi, Yk)\X,yf ; 
eo,i(fc) = E (Oik) - E(/i(Xi, Yi, . . . , Yk)\Yi)f . 

In particular, again by the Bounded Convergence Theorem, we have that lini„^^„^_>.oo E {U{k)) — S,ifi{k) 
and lini„^^„^^ooE(y(fc)) = Co,i(^)- Since 

U{k) - ei,o(A;) = (E(C/(fc)) - a,o(fc)) + (C/(fc) - E([/(fc))); 
V{k) - eo,i(^) = (E(V^(fc)) - (oAk)) + {V{k) - nV{k))); 



the lemma is now a direct consequence of mi\ and (^64|, and Theorem 1.5.4 of Durrett 22 



Proof of Theorem |4[ Fix fc > 1. Using (|16|) we have that 

1 



e{k) 

m 

9i{k) - e{k) 



= 1- 



1 



= 1- 



(fc) 



k 



%{k)- 



□ 



It follows by (28) and (29) that 



Slik) 
Sl{k) = 



- 1 U{k) 

2y - 1 k'^V{k) 



(65) 
(66) 



where U{k) and V{k) are as in (59) and (60), respectively. Furthermore, observe that 



In particular, due to (37), we obtain that 
SHk) 



^{Op{k)) 



1 



^ |[/(fc)-ei,o(fc)| ^ \V(k)-CoAk)\ , 



\U{k)\ 



\V{k)\ 



ei,o(fc) 



?o,i(fc) 



"-x6,o(fc) "-y^O,l(fc)" 



By Lemma 16 U{k) converges in probability to S,ifl{k), while similarly V{k) converges in probability 
to ^o,i(^); in particular, the first two terms on the right-hand side of the inequality converge to in 
probability. Since |C/(fc)| < 1 and |V^(fc)| < 1, the same can be said about the last two terms of the 
inequality. Consequently, S'^{k)/Y{9p{k)) converges to 1 in probability, as nx,ny — > co. As stated 
in (39), however, conditions (a)-(c) imply that Y(9p{k)) and Y{d(k)) are asymptotically equivalent as 
nx^ny — > oo, from which the theorem follows. □ 
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Figure Legends 




5Q 100 150 200 250 

Sample X 



Figure 1. Dissimilarity estimates. Heat map of 9{ny) sorted by site location metadata. Here, the 
X-axis denotes the sample from the environment corresponding to urn- a;, and similarly for the y-axis. 
The entries on the diagonal are set to zero. 
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Figure 2. Error estimates. Heat map of S{ny) sorted by site location metadata. Here, the j:-axis 
also denotes the sample from the environment corresponding to urn-x, and similarly for the y-axis, and 
the entries on the diagonal are again set to zero. 



Tables 



25 



x10"^ 
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Figure 3. Discrete derivative estimates. Heat map of \0{ny) ~ 9{ny — 1)|, sorted by site location 
metadata, following the same conventions as in the previous figures. 

Supporting Information Legends 

File SI. Summary Metadata related to Table 1 (tab-limited text file). 

File S2. OTU table related to Table 2 and Figures 1-4 (tab-hmited text file). 
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Estimates of e(k) 
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Figure 4. Sequential estimation. Plots of 6{k), with k = \ 
HMP data. 



for three pairs of samples of the 



Table 1. HMP data. Summary of V35 16S rRNA data processed by Qiime into an OTU table. 



Body Supersite 


Body Subsite 


Assigned Labels 


Airways 


Anterior Narcs 


1-5 


Throat 


6-17 


Gastrointestinal Tract 


Stool 


18-47 


Oral 


Attached/Keratinized Gingiva 


48-59 


Buccal Mucosa 


60-76 


Hard Palate 


77-90 


Palatine Tonsils 


91-112 


Saliva 


113-122 


Subgingival Plaque 


123-144 


Supragingival Plaque 


145-167 


Tongue Dorsum 


168-191 


Skin 


Left Antecubital Fossa 


192-195 


Left Retroauricular Crease 


196-217 


Right Antecubital Fossa 


218-222 


Right Retroauricular Crease 


223-242 


Urogenital Tract 


Mid Vagina 


243-248 


Posterior Fornix 


249-259 


Vaginal Introitus 


260-266 
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Table 2. Sample comparisons. Summary of estimates for three pairs of samples of the HMP data. 



Urn- a; 


Urn-i/ 




Uy 


e{ny) 


e{ny) - e{ny - 1) 


Regression Error 


S{ny) 




255 


176 


5054 


6782 


0.9998 


0.0 


0.0 


1.9892x10"* 


0.0 


200 


139 


12747 


5739 


0.0499 


-1.6533x10"* 


6.8306x10"'^ 


1.9286x10"^ 


0.9997 


100 


10 


6206 


8655 


0.0324 


-2.9416x10"" 


0.0438 


2.2477x10"^ 


0.5130 



